fix(proxy): SSH close race that drops agent's exec reply (EOF on session.Output)#41
Merged
Merged
Conversation
…ng srcChan
When sshHandleChannel processes a session, two independent forwarders
run in parallel: agent-to-upstream requests (e.g. exec, env, pty-req)
go through sshForwardChannelRequests(srcReqs, dstChan), and
upstream-to-agent requests (e.g. exit-status) go the other way. The
agent-to-upstream loop is NOT awaited before sluice closes srcChan.
A fast upstream that, in response to exec, immediately replies + writes
data + sends exit-status + closes lets sluice's wait on the three
upstream-to-agent goroutines complete and close srcChan while the
agent-to-upstream forwarder is still mid-flight: it has received the
upstream's CHANNEL_REQUEST_SUCCESS reply via dstChan.SendRequest but
has not yet called req.Reply on the agent side. The agent then receives
SSH_MSG_CHANNEL_CLOSE before its session.SendRequest("exec", true, ...)
sees the SUCCESS message on ch.msg. gossh closes ch.msg on
SSH_MSG_CHANNEL_CLOSE, which causes the blocked SendRequest to return
io.EOF (channel.go:603-606). session.Output("cmd") surfaces this as
"exec command via SSH: EOF" even though the upstream replied
successfully.
The fix wraps each iteration of the agent-to-upstream forwarder in a
sync.WaitGroup. sshHandleChannel waits for the WaitGroup to drain after
the three upstream-side signals, then closes srcChan. Any in-flight
exec reply has been fully written to srcChan before the close arrives.
The race was deterministically reproducible on the e2e-linux push-event
runner since commit d27b05e moved srcChan.CloseWrite() out of the data
goroutine, narrowing the window between the upstream's reply landing
and sluice's close. Eight successive main-branch e2e runs passed
before d27b05e; the next two (after PR #38 and PR #40 merges) both
failed at TestCredential_SSHInjection with the EOF symptom.
There was a problem hiding this comment.
Pull request overview
Fixes an SSH channel close race in the SSH jump host proxy where the agent can observe SSH_MSG_CHANNEL_CLOSE before receiving the CHANNEL_REQUEST_SUCCESS/FAILURE reply for WantReply=true requests (notably exec), causing session.Output(...) / session.SendRequest(...) to fail with io.EOF.
Changes:
- Add tracking for agent→upstream per-channel requests and wait for in-flight request replies before closing the agent-facing channel.
- Introduce a new
sshForwardChannelRequestsTrackedhelper and a pre-close barrier (inFlight.Wait()) insshHandleChannel.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Addresses Copilot review feedback on PR #41. sync.WaitGroup.Add and .Wait race when Add is called concurrently with Wait while the counter is zero — the Go runtime panics ("sync: WaitGroup misuse: Add called concurrently with Wait"). The previous tracked forwarder called Add(1) from inside a for-range loop that could enter a new iteration at any moment, racing the main goroutine's barrier.Wait() on a fresh counter. Replaces the WaitGroup with an inflightBarrier struct (mutex + cond + draining flag + counter). enter() returns false once draining is set so the forwarder rejects any further request without bumping the counter, which means drain() can converge even if a new agent request arrives mid-drain. Replying false to a WantReply request unblocks any caller blocked on ch.msg. Adds TestSSHJumpHost_BurstCloseDoesNotDropExecReply, a 50-iteration regression test that drives a no-sleep burst-close SSH server through the jump host and asserts that session.Output never returns EOF. The existing startTestSSHServer used by other ssh_test.go tests papers over the race with a 50ms sleep before returning from the channel handler; the new server omits that sleep so the race is deterministically triggered without the inflightBarrier fix.
…ction per iter Addresses Copilot review feedback on PR #41. 1. The burst-close test server's per-channel goroutine now defers ch.Close, so a request loop that exits without hitting the exec path (early agent close, non-exec-only request stream) still releases the server-side channel and its driving goroutine. 2. TestSSHJumpHost_BurstCloseDoesNotDropExecReply's loop now closes agentSSH and agentConn explicitly and waits on errCh with a 5s timeout per iteration. A leaked HandleConnection goroutine surfaces as a deterministic test failure rather than as silent resource exhaustion that could mask a regression in the close path.
Addresses Copilot review feedback on PR #41. The burst-close test previously discarded the value from errCh, which would let a future regression that produces a non-nil error from HandleConnection teardown pass silently. The select now captures the return and fails the iteration when non-nil, so any unexpected teardown error surfaces explicitly.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
sshHandleChannelwaits for the three upstream-to-agent forwarders to drain (data copy, stderr copy, request forwarder) before closing srcChan. It does not wait for the agent-to-upstream forwarder, which is the one that callsreq.Reply(ok, nil)to ack the agent's WantReply=true requests. A fast upstream that responds toexecwith reply + data + exit-status + close in one burst lets sluice's wait complete and close srcChan while this forwarder is still mid-reply.The agent's gossh client, blocked in
session.SendRequest("exec", true, ...)on<-ch.msg, seesSSH_MSG_CHANNEL_CLOSEbefore theSUCCESSreply. gossh closesch.msgon close (channel.go:385), and the blockedSendRequestreturnsio.EOF(channel.go:603-606).session.Output("whoami")surfaces this as "exec command via SSH: EOF".Repro
Deterministic on the e2e-linux push-event runner since d27b05e narrowed the close timing window:
TestCredential_SSHInjectionwith the EOF symptomFix
Track in-flight agent-to-upstream requests with a custom
inflightBarrier(mutex + cond + draining flag).enter()returns false once draining has begun so the forwarder rejects further requests without bumping the counter;drain()blocks until any current in-flight handler callsleave().sshHandleChannelcallsdrain()after the three upstream-side signals but beforesrcChan.CloseWrite()/srcChan.Close().A
sync.WaitGroupwould have been the obvious choice but is wrong here:Addcan race withWaitwhile the counter is at zero (Go runtime panics with "sync: WaitGroup misuse"). The forwarder ranges over srcReqs and can enter a new iteration at any moment, so the mutex+cond pattern is necessary.Test plan
go test ./...(2476 tests across 13 packages)gofumpt -l internal/cleangolangci-lint run ./internal/...cleanTestSSHJumpHost_BurstCloseDoesNotDropExecReply(50 iterations against a no-sleep burst-close SSH server) — passes with fix, deterministically fails withoutTestCredential_SSHInjection— PASS